   next_inactive up previous

                 Livermore Metagenomics Analysis Toolkit - LMAT

Contents

     * What's New in Version 1.2.4
     * Overview

          * Quick-Start

     * Required Software Packages
     * Compilation
     * Downloadable Databases
     * Searchable K-mer/Taxonomy Database Preparation

          * summary of steps required for building a database
          * How to update the taxonomy
          * prepare fasta file, and generate gi to tax ID mapping
          * run 'kmerPrefixCounter' to build reference genome database files
          * Tax Histo
          * Make DB Table

     * Custom Null Model Generation
     * Scoring and Classification

          * Query input files
          * Scoring (Labeling)
          * Classification
          * Read Analysis

     * Content Summarization
     * Gene identification
     * Advanced LMAT features

          * 16-bit ID Space
          * Taxonomy list pruning
          * Additional database creation feeds

     * AUSPICES STATEMENT

                          What's New in Version 1.2.4

     * Updated reference databases.
     * Support cross links between taxonomy profiles and gene label profiles.

     * Support for world-region classification.
     * Fixed bug in fs-report of genus for consistent output.

     * Fixed bug in reported k-mer count of input database.
     * New utility for output formatted for use with other tools.

   Earlier updates: Version 1.2.3

     * Database build supports human and select k-mer inputs.
     * Updated content summarization.
     * Permissive match feature.
     * Summary reporting at specific ranks. Additionally, genus and
       species/strain level reports generated by default are combined and
       formatted in html.
     * run_lmat.sh has been split into three shell scripts for each
       subcomponent of LMAT query processing: run_rl.sh for read labeling;
       run_gl.sh for gene labeling; run_cs.sh for content summarization.

   Version 1.2.2

     * Support for reading query input from stdin in read_label module. use
       ``-i -'' to enable this feature.
     * Fixed a bug where incorrect header information was included in LMAT
       raw output (individual read labels).
     * New aggresive plasmid assignment. Previously plasmids were
       under-reported due to their placement in the NCBI taxonomy heirarchy.
       The same plasmid associated with multiple strains will normally be
       classified as a species (or higher rank) call according to the LMAT
       lowest common ancestor placement logic. Now in addition to the
       standard LMAT classification procedure the highest scoring matches are
       saved. If the highest scoring match list includes a plasmid consistent
       with the initial LMAT classification, the final classification is
       switched to the plasmid. The drawback to this procedure is that
       multiple plasmids with tie scores are not explicitly represented in
       the final output label. LMAT, however, saves the scores of the
       competing matched sequences so a post processing step (not yet
       included) can easily tag plasmid calls that are non-specific to the
       initially reported taxonomy.

   Version 1.2.1

     * Support for full database download from the LMAT ftp site.
     * read_label no long re-prints the query input filename for every
       execution thread.

   Version 1.2 includes a number of major changes, which should be described
   in more technical detail in forthcoming publications. Changes include:

     * The kmer/taxonomy search database (DB) index structure was changed
       from gnu_hash to a custom designed index. This dramatically improves
       speed and dramatically reduces memory requirements
     * NCBI taxonomy identifiers are used instead of our own local database
       representation (should simplify interpretation)
     * The DB stores NCBI taxonomy identifiers in a 16-bit representation to
       substantially reduce memory requirements (our current Full database is
       372GB compared to the previous size of 540GB)
     * The read binning step is now included in the initial database search
       application to increase speed.
     * Added a content summarization step. Output from initial read labeling
       is re-processed to estimate abundance profiles and to assign each read
       (including higher ranked reads) to the nearest most likely genome of
       origin. A more detailed description of the method is given later in
       the documentation.
     * An gene annotation option is included.
     * The null-model has been improved select a GC context specific null
       model based on the GC content of the query read.
     * The run script for LMAT has been refined to better allow individually
       customized runs.
     * Input sets can now be ``load-balanced'' among multiple cores.
     * The module to label reads includes techniques for improving runtime
       with a slightly reduced accuracy by means of selectively pruning
       low-rank taxonomy identifiers.
     * The module to index the database can be enabled to use the same
       pruning technique (as above) to build databases with reduced numbers
       of taxonomy identifiers that should give even better runtime
       performance than runtime pruning.
     * Use latest je-malloc memory management library, which simplifies
       filemanagement
     * provide utility for retrieving organism specific reads and placing
       them into a single fasta file
     * The runtime-input files are now provided in a separate download from
       our ftp site. As a result, the source tarball is significantly
       smaller.
     * A script is provided to download databases and the runtime-input files
       from the ftp site.
     * Pre-compiled binary files are no longer being provided.

   Version 1.1.2 differs from LMAT-1.1.1 in only the following:

   We add support for null-model generation (Section 7) where random reads
   are now explicitly drawn from a range of GC content values.

                                    Overview

   This software package (LMAT) contains source code with the initial work
   described in "Scalable metagenomic taxonomy classification using a
   reference genome database". The web site for LMAT is
   http://computation-rnd.llnl.gov/lmat.

   The basic requirement is x86_64 Linux (tested with 2.6.32-358 or later
   kernels). Running on other 64-bit platforms is unconfirmed but feasible
   (please share your experiences). Our approach is divided into two steps:

   a.
           Searchable K-mer/Taxonomy Database Preparation:

           In somewhat simplistic terms, the database is a hashtable that
           maps k-mers to the genomes in which they appear, and the genomes
           to nodes in a taxonomy tree. (Hereafter we abbreviate "Searchable
           K-mer/Taxonomy Database" as DB.)

           Preparing a full DB involves several steps and takes many hours.
           ("Full" refers to a database that contains k-mers from all viral,
           prokaryote, fungal and protist genomes.)

           Our current ``Full'' database is 460 GB in size, smaller than what
           we used for our July 2013 publication in Bioinformatics due to
           several of the key optimizations within this release, yet larger
           than previous database releases due to a large increase in
           available reference sequences. We have several smaller ``marker''
           database available for ftp download. See the sec:downloads Section
           for more information about the various LMAT databases availble for
           download.

           An additional download required for running LMAT is the collection
           of ``runtime-input'' files needed to go along with the database
           (about 300MB in total). For convenience, we include a script to
           perform these ftp downloads, details below.

           Our databases are available for download from the ftp site:

           gdo-bioinformatics.ucllnl.org/pub/lmat.

           First download the external datafiles needed to run LMAT, these
           include the required NCBI taxonomy information, gene annotation
           and other support data.

           The most recent marker library is available with or without human
           k-mers included and are under 20 GB.

    cd <LMAT-root-dir>/bin

    # <name-of-input> reflects current taxonomy data used - current version is: 04072014
    ./get_db.sh --dtype=inputs --name=<name-of-input>  --outdir=<LMAT-root-dir>/runtime_inputs

    #  set the environment variable to point to the location
    setenv LMAT_DIR=<LMAT-root-dir>/runtime_inputs

    #  Download the marker libary, set <db-name> to kML.v4-14.20.g10.db
    #  or kML+Human.v4-14.20.g10.db

    ./get_db.sh --dtype=db --name=<db-name> <data-dir>

    #  Download the gene annotation libary, set <db-name> to
    #  lmat.genes.7-14.db (for gene annotation only, latest requires
    #  128 GB DRAM or SSD) ./get_db.sh <db-name> <data-dir>

    # Note that LMAT uses a .db extension for its database files,
    # omitted in the command syntax above

           Due to the time and effort involved in constructing a reference
           DB, we suggest that first-time users begin with our Marker
           database. In this case you can skip the sections "Searchable
           K-mer/Taxonomy Database Preparation", "Custom Null Model
           Generation.". Our sec:quick-start instructions give a brief
           example of how to perform classification using the downloadable
           Marker DB and included application binary executable.

           We include a shell script to download and assemble the larger
           marker library as it is broken into 5 separate compressed files.
           Run ./get_db.sh from LMAT-1.2.4/bin. Alternatively, each file can
           be downloaded separately to reduce the risk of incomplete
           downloads, then uncompressed and merged to create the whole marker
           database file.

           We highly recommend the use of a "ramdisk", Solid-State Drive, or
           other flash-based storage (eg. PCIe card) to store the
           uncompressed DB files. Make sure that there is enough space on the
           device to hold the relatively large file(s). Save these file(s) to
           a directory on that storage device and use gunzip to uncompress
           the *.gz.

           The files must remain with read-write ("rw") file access
           permissions set or LMAT cannot open them.

           Additionally, we encourage users to install the DI-MMAP runtime
           (optional) if considering use of LMAT databases on flash-based
           storage. Please see https://bitbucket.org/vanessen/di-mmap for
           more information. Accessing the database at run-time over a NFS
           mounted file system is not recommended.

             Tip: If a ramdisk is not available on a large (512GB+) server,
             we recommend that the database be ``precopied'' into the
             system's buffer cache (from network attached or local HDD
             storage) prior to running LMAT. LMAT uses memory mapped files so
             pages in the database are loaded on demand. Regular system
             prefetching of pages under the memory-mapped approach is slow
             unless run from an SSD. The effectiveness of precopy can depend
             on your system's buffer cache policy: some evict pages early so
             the buffer will not hold the LMAT database from the precopy to
             the following LMAT execution. To perform the ``precopy'':

             % cat <database-filename> > /dev/null.

   b.
           Scoring and Classification:

           K-mers in queries are searched for in the DB and taxonomic
           lineages are written to file; a PERL script subsequently analyzes
           this file and outputs classification and abundancy statistics. For
           more details, refer to the section sec:scoring.

Quick-Start

   This section gives instructions in brief on how to perform scoring and
   classification on an input query set using the Marker DB. The example in
   this section is covered in more detail in subsequent sections of this
   manual. Be sure you have successfully downloaded the reference database
   files as described above in sec:over-a (A) above. Once that is complete,
   you can run ``make'' to build binary executable application and run the
   scoring portion. To complete the example execution below, you need to
   provide:

   (1)
           an input .fasta file to classify.
   (2)
           a directory for output data files.

   Recommended configuration setting to use LMAT:

   disable address space randomization (if not already set).
   As super-user: echo 0 > /proc/sys/kernel/randomize_va_space
   -or-
   Use the ``setarch'' command using the -R option (see man page or setarch
   documentation online for more details.)

   Note this step is recommened to reduce any potential memory management
   conflicts. So far this step does not appear to be required and if address
   space randomization cannot be disabled, LMAT should still run.

   Example scoring and classification using LMAT with the Marker DB:

 # The environment variable LMAT_DIR must be set and point to the directory containing the files given in the runtime_inputs directory (see above for download information)
 # for example in bash:
 export LMAT_DIR=../runtime_inputs
 # or csh:
 #setenv LMAT_DIR ../runtime_inputs

 cd LMAT-1.2.4/

 # build binaries
 make

 #  create a temporary directory to store LMAT output
 mkdir tmp

 # unpack the simple example, includes 1000 reads from 3 organisms, and example LMAT output
 cd example
 tar zxvf example.tgz
 cd ..

 cd bin

 There are three main analysis commands:
 1) assign taxonomic labels to reads
 2A) assign gene labels to reads
 2B) compute k-mer abundance statistics
 Step 1 must be run prior to running the optional steps 2A and 2B. Steps 2A and 2B can be run independetly of each other.
 Note that output can vary slightly depending on the choice of the database.

 ## 1) Assign taxonomic labels to each read and report summary of organism contents

 ./run_rl.sh --db_file=<path-to-library/kML+Human.v4-14.20.g10.db --query_file=../example/simple_list.1000.fna --odir=../tmp --threads=8

 # Output files:
 simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.species - Bins reads by species rank and reports the top strain match (where possible)
 simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.plasmid - Bins reads assigned to plasmids.
 simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.genus - Bin reads by genus rank
 simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary - Bins for the raw LMAT search results, read bins are in rank flexible categories
 simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.[0-7].out - individually taxonomically labeled reads; NOTE these files will take up substantial disk space as more runs are completed. They are retained for quick access for pulling reads of interest for further analysis


 ## 2A) Assign gene identifiers to each read a report sumary of gene contents
 find ../tmp -name simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output[0-9]\*.out > ../tmp/rl_output.flst
 ./run_gl.sh --db_file=<path-to-gene-database/gene.20mer.db --ilst=../tmp/rl_output.flst --odir=../tmp --filesum=../tmp/simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary

 # Output files:
 simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.ras.flst.gene.20mer.db.rl_output.0.1.20.genesummary - gene content summary
 simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.ras.flst.gene.20mer.db.rl_output[0-7].out - individually gene labeled reads
 # modifies .genus, .species, etc.

 ## 3) Run content summarization

 find ../tmp -name simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output[0-9]\*.out > ../tmp/rl_output.flst
 ./run_cs.sh --db_file=<path-to-gene-database/gene.20mer.db --ilst=../tmp/rl_output.flst --odir=../tmp --filesum=../tmp/simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary


 # Output files with supplementary information
 simple_list.1000.fna.kML.18mer.16mit.db.lo.rl_output.0.30.nomatchsum - Count reads not assigned a taxonomic ID
 simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.summ.leftover - Count reads / taxnomic IDs exluded from the final *.summ file (due to not meeting minimum score/abundance thresholds)

 # Description of tab delimited columns for each type of output file

 *.fastsummary file - Weighted (by taxid assigned score) read count, non-weighted read count, NCBI taxonomy identifier, taxonomy rank and name
 *.species - Average read score, Total weighted read count, Count of reads assigned to organism, NCBI taxnomic species ID, taxonomic name, Total strain read score, Count of reads assigned to top strain, NCBI taxonomy strain ID, taxonomy name
 (When strain information is not reported it means no strain information could be recovered)
 *.genus - Average read score, Total weighted read count, Count of reads assigned to organism, NCBI taxnomic genus ID, taxonomic name
 *.plasmid - Average read score, Total weighted read count, Count of reads assigned to organism, taxnomic plasmid ID (frequently a custom LMAT ID), taxonomic name (frequently a fasta header from the original plasmid source sequence)

 *.nomatchsum - gives basic information on all reads not included in the taxonomic profile.

 ReadTooShort  -- Gives number of reads with too few valid k-mers (minimum of 30 required)
 NoDbHits -- Number of reads with no matching k-mers in the database.
 LowScore -- Number of reads assigned a taxonomic label with a score below the minimum threshold (0)

 *.genesummary - read count assigned to gene, LMAT NCBI taxonomic ID, NCBI taxonomic ID for source gene, NCBI Gene ID, NCBI Locus tag, description, gene type and protein accession.
 *.lo.rl_output[0-9]*.out - query read identifier, query read sequence, scoring statistics (average,standard deviation, # of k-mers), list of taxid,score pairs, final LMAT taxid call with score and match type. Match types are MultiMatch (lowest common ancestor), DirectMatch (best match), NoMatch (no database match).

 Note when run\_gl.sh is run - and the *fastsummary file is provided as input, the script will regenerate the *species file with additional gene information as follows:
     Column 4) Pcnt . rRNA - Percentage of reads in bin that were assigned to a rRNA gene.

     Column 5) No. Genes - Number of genes associated with the taxonomic bin. Note this number must be viewed with caution, since multiple closely related genes may be reported but may reflect a single gene.

     Column 6) No. Gene Reads - Number of reads in bin that were assigned to a gene.

     The original columns starting at column 4 are shifted over to the right.

 simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.summ.RANK_kmer_cov

 Reports k-mer abundance distribution profiles for each taxonomic bin for the selected rank - RANK (e.g. species, genus etc.).
 The file reports for each taxononomic bin the number of distinct k-mers observed in the reads assigned to the bin and the total number of k-mers observed in the bin.
 With enough reads in a taxonomic bin, a distribution is also reported with 4 columns, for example:
 10995 14 1 50

 Where the first column is the NCBI identifier of the bin, the second column is the k-mer value (here k=14) and the third column reports the number of times k-mers were observed and the fourth column is the number of k-mers that were observed. In the example above, there were 50 distinct 14-mers that were found once.

 Collectively these k-mer statistics are meant to provide a measurement tool for assesing the coverage of taxonomic bin.

 simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.ordered.RANK
 provides a modified version of the *.fastsummary.RANK files (where RANK is species,genus, etc.)

 Additional columns are included reporting k-mer coverage statistics.  Each additional tab delimited column is a 4-tuple of the (a,b,c,d) where a is the value of k for the k-mer, b is the second peak in the k-mer abundance profile taken from the *.RANK_kmer_cov file described above. Value c is the distinct k-mer count and d is the total count of k-mers assigned to the taxonomic bin.


 # Additional example run configurations

 # run just the full database
 ./run_rl.sh --db_file=<path-to-database>/lmat-4-14.20mer.db --query_file=<your-fasta-file> --odir=<path-to-output-directory>

   Note that the marker libraries have a smaller range of values making 0 (or
   values close to 0) the best cutoff. For the full libraries the range of
   values is larger giving the user some flexibility to choose higher cutoff
   values of 1 or 2 to select only the highest confidence tax calls (although
   0 is still a reasonable default cutoff). Choice of cutoff depends on the
   type of sample and analysis. The default (threshold of 0) is meant to
   maximize sensitivity to detect more distantly related organisms. It is
   recommended to run with 0 cutoff first, and sort the *.species file by the
   first column (average read score), which will give you an indication for
   the range of read scores and where to set the cutoff higher to reduce the
   potential for false positive calls.

   Check https://sourceforge.net/p/lmat/wiki/ for additional example
   information.

   If the pre-built executable does not load on your system, proceed with the
   sec:compile step below.

                           Required Software Packages

   gcc 4.4.X
   python 2.6
   OpenMP
   MPI (optional, for use in building a Reference Database)

                                  Compilation

    1. If using MPI (optional; only used in some drivers when building a DB
       from scratch) set up environment variable using your shell syntax;
       typically:

    export HAVE_MPI=1

   or:

    setenv HAVE_MPI 1

       (note: MPI is not used in the Scoring and Classification steps)

    2. LMAT-1.2.4/bin contains prebuilt executables for all c++ drivers. If
       these do not work on your system, cd to LMAT-1.2.4 and run make
    3. Disable address space randomization (if not already set).
       As super-user: echo 0 > /proc/sys/kernel/randomize_va_space

                             Downloadable Databases

   LMAT has several prebuilt databases available for download. Use the
   bin/get_db.sh utility to automatically perform downloads from the ftp
   site, mentioned above, using any of the following databases.

   Note that the database organization changed between releases. Current
   databases use the same file and key names.

   Current databases:

   +------------------------------------------------------------------------+
   |    Keyname/filename     |          Description           |File size(GB)|
   |-------------------------+--------------------------------+-------------|
   |                         |    Full-sized database for     |             |
   |   lmat-4-14.20mer.db    |     extensive read binning     |     460     |
   |                         |          (LMAT-Grand)          |             |
   |-------------------------+--------------------------------+-------------|
   |                         |Microbial marker database (small|             |
   |   kML.v4-14.20.g10.db   |  database for fast microbial   |     17      |
   |                         |           profiling)           |             |
   |-------------------------+--------------------------------+-------------|
   |                         | Microbial marker database with |             |
   |kML+Human.v4-14.20.g10.db|  explicit human read tagging   |     18      |
   |                         |    (small database for fast    |             |
   |                         |      microbial profiling)      |             |
   |-------------------------+--------------------------------+-------------|
   |   lmat.genes.7-14.db    |  Gene database for gene name   |     120     |
   |                         |            binning             |             |
   |-------------------------+--------------------------------+-------------|
   |    lmat-world-region    |Database for binning human reads|      2      |
   |                         |        by world region         |             |
   +------------------------------------------------------------------------+

   Previous databases (using 9/4/13 sequence data):

   +------------------------------------------------------------------------+
   |   Database key-name (for get_db   |         Filename         |File size|
   |              script)              |                          |  (GB)   |
   |-----------------------------------+--------------------------+---------|
   |          kML-18mer-large          | kML.18mer.no_prune.16.db |   80    |
   |-----------------------------------+--------------------------+---------|
   |         kML-18mer-medium          |    kML.18mer.16bit.db    |   52    |
   |-----------------------------------+--------------------------+---------|
   |          kML-18mer-small          |kML.18mer.16bit.reduced.db|   16    |
   |-----------------------------------+--------------------------+---------|
   |            kFull-20mer            | m9.20mer.16bit.g1000.db  |   373   |
   |-----------------------------------+--------------------------+---------|
   |            gene-20mer             |      gene.20mer.db       |   52    |
   +------------------------------------------------------------------------+

                 Searchable K-mer/Taxonomy Database Preparation

   Note: some drivers and scripts described below contain hardcoded paths, so
   execute from within the src directory.

summary of steps required for building a database

   This section gives a step-by-step overview of the steps required to build
   a searchable reference database for use with LMAT. This example assumes
   that you have already compiled the LMAT-1.2.4and that your working
   directory as specified below is also suitable for placing the database
   (eg. ramdisk or SSD). If not, is is fine to substitute "WORK" below with
   any suitable directory. The subsequent sections 6.2 - 6.6 expand upon
   these steps in more detail.

   cd <filepath>/LMAT-1.2.4/data

     * download gi_taxid_nucl.dmp.gz from ftp://ftp.ncbi.nih.gov/pub/taxonomy

   cd <filepath>/LMAT-1.2.4/src

 mkdir WORK

 #assume your reference genome fasta file is: WORK/test.fa.
 #Note to get started we provide a list of genome files from fall 2012, which we used for our initial database
 #the identifiers can be obtained here:  wget ftp://gdo-bioinformatics.ucllnl.org/pub/lmat/microbe2.gi.header_table.gz
 #currently the actual genomes sequences must be retrieved from NCBI

 Individual genome sequences must be mapped to a taxonomy identifier
 The mapping is specified as a tab delimited file with the first column containing the tax id and the second
 column should contain the header associated with sequence stored in the input fasta file (WORK/test.fa below)
 For example:
 418127   >ref|NC_009782.1|gnl|NCBI_GENOMES|21340|gi|156978331|Staphylococcus aureus subsp. aureus Mu3, complete genome

 build_header_table.py WORK/test.fa GenomeToTaxID.txt WORK
 ## WORK/test.fa.int file will be generated as part of the output

 The current version of LMAT requires 16-bit taxonomy identifers (to save space) so a 32-16 bit mapping must be created
 If the mapping from sequence fasta headers to tax ids is store in a file tax_gen_map.txt
 cut -f1 tax_gen_map.txt > tax_32bit_ids.txt
 Tid16_getMapping.py tax_32bit_ids.txt ../runtime_inputs/ncbi_taxonomy.data tax_32to16.txt
 The mapping will now be contained in the file tax_32to16.txt


  kmerPrefixCounter -i WORK/test.fa.int -k 20 -o WORK/kmerDB -l 1 -f 0
  kmerPrefixCounter -i WORK/test.fa.int -k 20 -o WORK/kmerDB -l 1 -f 1
  kmerPrefixCounter -i WORK/test.fa.int -k 20 -o WORK/kmerDB -l 1 -f 2
  kmerPrefixCounter -i WORK/test.fa.int -k 20 -o WORK/kmerDB -l 1 -f 3
  tax_histo -o WORK/out.tax_hist.0 -d WORK/kmerDB.0 -g WORK/test.fa.gi_to_tid_map -t ../runtime_inputs/ncbi_taxonomy.dat -f 32
  tax_histo -o WORK/out.tax_hist.1 -d WORK/kmerDB.1 -g WORK/test.fa.gi_to_tid_map -t ../runtime_inputs/ncbi_taxonomy.dat -f 32
  tax_histo -o WORK/out.tax_hist.2 -d WORK/kmerDB.2 -g WORK/test.fa.gi_to_tid_map -t ../runtime_inputs/ncbi_taxonomy.dat -f 32
  tax_histo -o WORK/out.tax_hist.3 -d WORK/kmerDB.3 -g WORK/test.fa.gi_to_tid_map -t ../runtime_inputs/ncbi_taxonomy.dat -f 32

  # generate the file containing list of binary source data files

  ls WORK/*.bin > WORK/table_input_list

  make_db_table -i WORK/table_input_list -o WORK/my_table -k 20 -f tax_32to16.txt

How to update the taxonomy

   This release contains a prebuilt file,
   LMAT-1.2.4/runtime_inputs/ncbi_taxonomy.dat, so you only need to rebuild
   if you're using sequences with gi numbers that are not in
   ncbi_taxonomy.dat. (LMAT-1.2.4/runtime_inputs/ncbi_taxonomy.dat was built
   from NCBI files downloaded on September 4, 2013)

   to rebuild:

   download taxdmp.zip from ftp://ftp.ncbi.nih.gov/pub/taxonomy and unzip;
   you can download these to a separate directory if you wish.

   >parse_ncbi_taxonomy.py <dir>

   where: dir> is directory containing uncompressed files from taxdmp.zip

   This will create three files:

   ../runtime_inputs/ncbi_taxonomy_rank.txt
   ../runtime_inputs/depth_for_ncbi_taxonomy.dat
   ../runtime_inputs/ncbi_taxonomy.dat

prepare fasta file, and generate gi to tax ID mapping

   a)
           Download gi_taxid_nucl.dmp.gz from
           ftp://ftp.ncbi.nih.gov/pub/taxonomy and place it in the
           LMAT-1.2.4/data directory. Note: this is a large file,
           approximately 700M. Do not uncompress this file.
   b)
           run: src/prepare_fasta.py fasta.fn <output-directory>

           where 'fasta.fn' contains your fasta formatted reference genomes.
           The sequence headers should contain NCBI gi numbers, else they
           will be discarded.

           Note users can retrieve bacterial, viral and plasmid data as
           follows:

    wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz
    wget ftp://ftp.ncbi.nih.gov/genomes/Viruses/all.fna.tar.gz
    wget ftp://ftp.ncbi.nih.gov/genomes/Plasmids/all.fna.tar.gz

           A list of files used in our original database, which includes
           eukaryote genomes is provided here:

    wget ftp://gdo-bioinformatics.ucllnl.org/pub/lmat/microbe2.gi.header_table.gz

           The <output-directory> will contain several files. fasta.fn.int is
           the input to the next step: kmerPrefixCounter.
           all_virus.gi_to_tid_map should replace gid_to_tid_kpath.txt for
           the classification phase.

     * Nice to know: prepare_fasta.py executes the following three codes:

 get_gi_numbers.py

     function: parses fasta header files to extract NCBI gi numbers;
     input: fasta file
     input: data/gi_taxid_nucl.dmp.gz
     output: file with one gi number per line

 build_gi_to_tid_map.cpp

     function: maps NCBI gi numbers to NCBI tax_id numbers
     input: output file from get_gi_numbers.py
     input: gi_taxid_nucl.dmp.gz
     output: file with 'gi tid' per line

 build_header_table.py

       transforms fasta file containing reference genome into format with a
       single id for header; the header is the NCBI taxonomy node ID
       associated with the sequences gi number. If the input file is
       'fasta.fa,' the output files will be:

     fasta.fa.int        - same as fasta.fa, but with transformed headers
     fasta.fa.gi.table   - each two-line entry contains
                            (1) gi number
                            (2) header from the input file
     fasta.fa.gi.table   - each two-line entry contains
                            (1) tax node ID;
                            (2) header from the input file

run 'kmerPrefixCounter' to build reference genome database files

   Usage for kmerPrefixCounter:

     -i <string>  - input fasta_fn
     -k <int>     - k-mer length
     -o <string>  - output filename
     -l <int>     - prefix length (as in: the string representation of the prefix)
     -f <int> [optional] first prefix

   This driver can build DB files for a large set of reference genomes by
   incrementally building files that only contain k-mers that begin with a
   given prefix. This driver can be run either sequentially or in parallel
   mode.

   If you invoke with '-l 1' you will have four output files; the first will
   contain (binary encoded) k-mers that begin with 'a', the second k-mers
   that begin with 'c,' and so on. For '-l 2' there will be 16 output files.
   In general, there will be $4^l$ output files.

   The output files contain k-mers, and each k-mer has a list of genome IDs
   (gi numbers) in which it appears.

   Sequential Example:


     kmerPrefixCounter my_fasta_file.int -k 18 -o kmer_and_genomes -l 1 -f 0
     kmerPrefixCounter my_fasta_file.int -k 18 -o kmer_and_genomes -l 1 -f 1
     kmerPrefixCounter my_fasta_file.int -k 18 -o kmer_and_genomes -l 1 -f 2
     kmerPrefixCounter my_fasta_file.int -k 18 -o kmer_and_genomes -l 1 -f 3

           this will produce four output files:

           kmer_and_genomes.0, kmer_and_genomes.1, etc.

   Parallel Example 1:
           Use your favorite utility to run kmerPrefixCounter jobs in
           parallel, eg. gnu parallel (in your path).

     seq 0 3 | parallel kmerPrefixCounter my_fasta_file.int -k 18 -o kmer_and_genomes -l 1 -f {.}

   Parallel Example 2:

           assuming you have a cluster with eight nodes:


     mpirun -np 8 kmerPrefixCounter my_fasta_file.int -k 18 -o kmer_and_genomes -l 2 -f 0
     mpirun -np 8 kmerPrefixCounter my_fasta_file.int -k 18 -o kmer_and_genomes -l 2 -f 8

           the first run output files kmer_and_genomes.0 through
           kmer_and_genomes.7, and the second outputs files
           kmer_and_genomes.0 through kmer_and_genomes.15

Tax Histo

   Run tax_histo to construct the k-mer/taxonomy database (discussed in
   Section 2.1 of our paper); run with no options for usage information.
   Output files -o used for running make_db_table.

Make DB Table

   Run 'make_db_table;' run with no options for usage information. For the
   required argument to specify input '-i <list-file>', <list-file> contains
   the list of binary data (.bin) files generated in step [*] above, eg.
   my_tax_histo_output.bin

   Once the execution has completed, the output filename specified by -o must
   remain with read-write ("rw") file access permissions set or the
   subsequent applications cannot open.

   When running sec:scoring you would specify "-d my_ref_db"

   To use make_db_table with values of $k$ (-k) other than 20, requires
   recompiling with alternate environment variables set. For 18-mers ($k=18$)
   set the variable SDBIDX=1827 prior to compilation. This sets up LMAT's
   two-level indexing to work with 18-mers. Other LMAT binaries do not
   require this change for use with 18-mer databases built in this way.

   For values of $k$ other than 18 or 20, set the environment variable
   USE_SORTED_DB=0 prior to compilation. All LMAT binaries must be built with
   this variable set to use values of $k$ other than 18 or 20. This enables
   LMAT indexing to use a hash table rather than two-level indexing. Future
   releases may support different $k$ databases without need for these
   particular compile-time changes.

                          Custom Null Model Generation

    1. Files needed for this step are placed in the rmodel subdirectory. Two
       scripts must be run to generate the model files. First, need to
       prepare a count of k-mers per every taxonomy identifier.
    2. frequency_counter: this executable takes as input the ascii output
       file from step [*].
    3. If run for multiple input files, use combine_counts.py to aggregate to
       a single file., eg:

 ls *.kcnt | python combine_counts.py > my_counts.kcnt

    4. Sort the output by count (column 2) in descending order.
    5. Run:

 gen_rand_mod.sh --db_file=<reference-database> --read_len=<integer> --tax_histo_count=<count-kmers.py-output> --filter=<cutoff-param>

       Generated files will have the prefix "null" and will be in the form:

       null.bin.10.<database-name>.<read-length>.<number-of-reads.rl_output.rand_lst.gz

       Lines of the file appear as:

       <kmers-per-line> <corresponding file>

       and ranked in ascending order

       Eg. for read length of 80 and using k=20, the line would appear as:

       61 null.bin.10.20merDB.80.480000.rl_output.rand_lst.gz

                           Scoring and Classification

Query input files

   Input query (read) files should be in fasta format, and sequences should
   be contained on a single line. If your sequences span multiple lines, they
   should be preprocessed with the reformatQueryFile.py script.

   The PhymmBL dataset reported on in our paper can be downloaded from:
   http://www.cbcb.umd.edu/software/phymm/exp_datasets/01__10_X_100_bp_test_sets/10_X_100_bp.tgz

Scoring (Labeling)

   The read_label driver searches the Reference Database for all k-mers in
   the queries, and outputs taxonomic lineages, as discussed in Section 2.2
   and Figure 2 in our paper. Output files have the format: <prefix>N.out -
   where N is 0 to $(numthreads-1)$ - even in single threaded execution.

   Input to read_label can come from stdin (supporing piped input) instead of
   an input file by using: -i -

   run_rl.sh is the wrapper script for running the LMAT read labeling
   pipeline; it takes several required and optional paramaters; run with no
   arguments for usage and example invocation.

Classification

   LMAT bins the reads that were assigned a taxonomic label with a score
   greater than or equal to the value specified by -x in the read_label
   (min_score variable in run_rl.sh) binary, the value defaults to 0. The
   output of the initial binning is in the file <prefix>.fastsummary, the
   file <prefix>.lineage, gives the output as human readable lineages that
   Krona http://sourceforge.net/p/krona/home/krona/ can process.
   <prefix>.lineage.html gives the Krona output (when Krona is installed on
   the user system).

   The user has the option re-bin the reads applying different thresholds,
   without re-running read_label binary.

    1. losummary_fast_mc.sh is script to re-bin reads using the different
       thresholds type losummary_fast_mc.sh for usage

   example usage:

 losummary_fast_mc.sh --file_lst=<list_of_output_files_from_read_label.txt> --min_score=<new minimum score>
 <min_score> if user wants to pick up more distantly related hits, try a value of -0.5 or -1.0,
 if the user wants to focus on high confidence hits try a value of 1.0

   A similar script is available for re-binning genes with a different
   threshold. type gsummary_mc.sh for usage.

Read Analysis

   A common post processing step is to retrieve reads for a specific organism
   or set of taxonomic IDs. pull_reads_mc.sh (type without command line
   arguments for usage) retrieves reads from LMAT output. Output can come
   either from the initial read_label output (the *.out output files) or the
   reads that were re-assigned to potentially more specific organisms during
   the content summary stage (these are the *.ras.1 output files) example
   usage:

 pull_reads_mc.sh --file_lst=<list of files containing read label output> --idfile=<file listing the taxids specify the reads to store in fasta files>

   You can specify multiple taxids to be placed in one fasta file, and/or
   place multiple taxids in distinct fasta files.

 Using the example included with the distribution try the following in your test working directory:
 echo 136845 > tid.txt
 ../bin/pull_reads_mc.sh --idfile=tid.txt --file_lst=simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.flst

   The script will create a fasta file with the selected reads, creating a
   filename from the input information. Output should be a fasta file with
   the name (names differ with the different choice of database)
   simple_list.1000.fna.kML+Human.v4-14.20.g10.db.tid.txt.136845.fna

   A second utility script is included called build_taxid_lst.sh, which
   allows the user to build a taxid list using a NCBI taxonomy string key
   word of the form rank,rank_value. For example kingdom,Virus will retrieve
   all reads explicitly tagged as viral reads in the user specified
   *.fastsummary files (output generated by LMAT's read_label and
   losummary_fast_mc.sh example usage:

 build_taxid_lst.sh --file_lst=fastsummary.file_lst --sstr=superkingdom,Viruses --idfile=viruses

   Using the example included with the distribution try the following in your
   test working directory:

 ls -1 *.fastsummary > fastsummary.file_lst
 ../bin/build_taxid_lst.sh --file_lst=fastsummary.file_lst --sstr=superkingdom,Viruses --idfile=viruses

   The script will create a text file with the taxonomic IDs, creating a
   filename from the input information. For the example above the output file
   should be:
   simple_list.1000.fna.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.viruses.idtxt

   Most of the utility scripts that operate on sequencer reads require GNU
   parallel (http://www.gnu.org/software/parallel/), which is included with
   the LMAT distribution for conveniance.

   LMAT2multi-fastsummaryTable.pl : To convert .fastsummary, .species,
   .genus, or .plasmid files to tabular files for analysis with tools such as
   MEGAN or custom R scripts, use this script. It takes a list of LMAT
   summary files to a table with organisms as rows and metagenome samples as
   columns, with entries as either the fraction of reads or read counts. Run
   it without arguments for further instructions.

                             Content Summarization

   There are major changes to the content summary module (called content_summ
   in run_cs.sh). read_label generates the rank-flexible list of taxonomic
   IDs (TIDs), reporting the number of reads assigned to each TID and the
   weighted read count (the sum of scores from each read assignment). Reads
   inititially assigned to higher ranks, are no longer aggresively
   re-assigned to lower order ranks, based on the first pass assement of
   sample contents. This so far has proven to be an overly aggresive analysis
   - over reaching the amount of taxonomic information that could be
   extracted from individual reads. Instead, a more emperical reporting is
   attempted by reporting the k-mer coverage distributions for various values
   of k for each species call. The working hypothesis is that this
   information can be used to infer the coverage of individual genomes,
   obviating the need to map the reads to individual genomes and compute
   coverage statistics. This option is made available for further
   consideration.

                              Gene identification

   There is now a gene identification step. The user can download the gene
   database, which contains all microbial genes obtained from microbial
   genome database (current does not include all of NR). Reads are searched
   and the best gene match is reported. Matches are in DNA space. Output is
   placed in a file <prefix>.genesummary

                             Advanced LMAT features

16-bit ID Space

   The make_db_table utility contains an option to store taxonomy IDs as
   16-bit values. The downloadable databases are built with 16-bit taxonomy
   id fields. The instructions given above to use those reference databases
   refer to a file that maps traditional 32-bit taxonomy IDs (eg. from NCBI)
   to 16-bit IDs. This is in the run_rl.sh helper script or the '-f' option
   to the read_label binary executable.

   In order to use custom 16 bit IDs within a custom reference database, this
   mapping file needs to be generated from file containing the complete list
   of taxonomy IDs. If 32-bit IDs are adequate or necessary due to the ID
   space, you can skip most of these steps. However, you should ensure that
   you have compiled the LMAT binaries with the following environment
   variable set: TAXID_SIZE=32

   The complete list of taxids used in a database can be obtained from the
   output of the frequency_counter / combine_counts.py execution. Run
   Tid16_getMapping.py with that file as input. Output the mapping file (user
   chooses the filename) which is used in the following steps

   Ensure you have compiled (make) LMAT with the following enviroment
   variable set: TAXID_SIZE=16 Use the -f <filename> option with
   make_db_table to specify the filename when creating the database index.

   Run LMAT (read_label) with -f <filename> option to specify the mapping
   table or set the file in the run_rl.sh helper script.

Taxonomy list pruning

   The LMAT database indexes contain lists of taxonomy ids for each k-mer.
   Some lists are significantly long enough to affect LMAT classification
   performance (read_label). To mitigate this issue, LMAT has the ability to
   prune these lists of taxids to increase classification speed at the
   expense of some degree of lost accuracy.

   LMAT supports two types of pruning: (1) reduce lists of strain IDs to
   species only (faster) when list length exceeds a cutoff value; (2) prune
   any lower-rank taxid when list length exceeds the cutoff value. The
   following options are passed to the make_db_table executable: the cutoff
   value specified with -g <max taxid-list length> ; table file specified by
   -m <filename> ; -w toggles the strain-species filtering on. If set then
   (1) above is used and then use specific ``species map'' table file for -m.
   If -w is ommitted, then (2) above will be used and use the ``numeric
   rank'' file for -m. Sample files are available for LMAT downloadable
   databases.

   The LMAT software distribution comes with scripts to generate these files.
   build_tid_numeric_rank_table.py requires a ``taxonomy_rank.txt'' file as
   input and produces the ``numeric rank'' mapping table file for (2) above.
   build_species_level_map.py requires both the ``taxonomy_rank.txt'' and
   ``taxonomy.dat'' files to produce the table file for (1).

Additional database creation feeds

   The make_db_table utility included to produce the database index file
   (used by the LMAT classifier) has the ability to add human (ncbi: 9606)
   specific k-mers from an additional feed. K-mers in this feed should be in
   ascii form with out duplicates, lexicographically sorted, and in
   canonicalized form. Use -j command-line option to specify the input file
   for this feed of k-mers. Additionally, -c must specify the count of these
   k-mers.

   The second additional feed is used to specify k-mers from "synthetic
   construct" sequences that should map exclusively to that taxonomy id
   (ncbi: 32630). This feature will assist LMAT in identifying reads to such
   sequences without ambiguity given that these sequences often contaminate
   reference sequences. Use -u command-line option to specify the filename.
   As with the human k-mer feed, k-mers in this feed should be in ascii form
   with out duplicates, lexicographically sorted, and in canonicalized form.

                               AUSPICES STATEMENT

   This work performed under the auspices of the U.S. Department of Energy by
   Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344

                            About this document ...

   Livermore Metagenomics Analysis Toolkit - LMAT

   This document was generated using the LaTeX2HTML translator Version 2008
   (1.71)

   Copyright (c) 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based
   Learning Unit, University of Leeds.
   Copyright (c) 1997, 1998, 1999, Ross Moore, Mathematics Department,
   Macquarie University, Sydney.

   The command line arguments were:
   latex2html -split 0 lmat-doc.tex

   The translation was initiated by Alexander K. Ames on 2014-09-22

   --------------------------------------------------------------------------

   next_inactive up previous

    Alexander K. Ames 2014-09-22
